Blind Data Linkage Using n-gram Similarity Comparisons

نویسندگان

  • Tim Churches
  • Peter Christen
چکیده

The task of linking together information from one or more data sources representing the same entity (patient, customer, business, gene sequence, etc.) If no unique identifier is available, probabilistic linkage techniques have to be applied Real world data is often dirty Missing values Typographical and other errors Different coding schemes / formats Out-of-date data Names and addresses are especially prone to data entry errors

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Some methods for blindfolded record linkage

BACKGROUND The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust ...

متن کامل

DKPro Similarity: An Open Source Framework for Text Similarity

We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and p...

متن کامل

Genomic Linkage Analysis of Iranian Clinical Isolates of Dermatophytes Fungi Using the RAPD-PCR

Dermatophytes are a group of keratinophilic fungi capable of invading keratinized tissues (skin, hair and nails). They cause dermatophytosis (commonly known as tinea or Ring worm) in human and animals. In this report, DNA similarities and genomic linkage of 40 dermatophytes strains was obtained from different universities, were studied by random amplified polymorphic DNA (RAPD–PCR) using 11 ra...

متن کامل

RTE4: Normalized Dependency Tree Alignment Using Unsupervised N-gram Word Similarity Score

We propose an unsupervised similarity metric to measure the relevance of word pairs using the Web1T data. The alignment scores between the dependency trees of the text and the hypothesis sentences are calculated based on this new similarity metric and these scores are used to predict the entailment between the text and the hypothesis sentences. The new similarity metric together with other feat...

متن کامل

String Metrics and Word Similarity applied to Information Retrieval

Over the past three decades, Information Retrieval (IR) has been studied extensively. The purpose of information retrieval is to assist users in locating information they are looking for. Information retrieval is currently being applied in a variety of application domains from database systems to web information search engines. The main idea of it is to locate documents that contain terms the u...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004